Day 2

Uroš Godnov

String manipulation

Putting strings together with stringr

  • str_c()
  • the c is short for concatentate, a function that works like paste()
str_c("Beautiful","day", sep=" ")
[1] "Beautiful day"
str_c("Beautiful",NA, sep=" ")
[1] NA
paste("Beautiful",NA, sep=" ") #base R
[1] "Beautiful NA"

Length

nchar(c("Bruce", "Wayne"))
[1] 5 5
stringr::str_length(c("Bruce", "Wayne"))
[1] 5 5
#factors
f<-factor(c("good","good", "moderate", "bad"))
nchar(f) #nchar throws error
Error in nchar(f): 'nchar()' requires a character vector
stringr::str_length(f)
[1] 4 4 8 3

Extracting substrings

  • str_sub()
  • extracts parts of strings based on their location
  • first argument, string, is a vector of strings
  • the arguments start and end specify the boundaries of the piece to extract in characters
  • both start and end can be negative integers, in which case, they count from the end of the string
str_sub(c("Bruce", "Wayne"), 1, 4)
[1] "Bruc" "Wayn"
str_sub(c("Bruce", "Wayne"), -3, -1)
[1] "uce" "yne"

Matches

  • str_detect(): answers the question: Does the string contain the pattern?
  • str_subset(): subsetting strings based on match
  • str_count(); counting matches
pizzas <- c("cheese", "pepperoni", "sausage and green peppers")

str_detect(pizzas, pattern = "pepper")
[1] FALSE  TRUE  TRUE
str_subset(pizzas, pattern = fixed("pepper"))
[1] "pepperoni"                 "sausage and green peppers"
str_count(pizzas, pattern = fixed("pepper"))
[1] 0 1 1
str_count(pizzas, pattern = fixed("e"))
[1] 3 2 5

Parsing strings into variables

str_split(): pull apart raw string data into more useful variables

date_ranges <- c("23.01.2017 - 29.01.2017", "30.01.2017 - 06.02.2017")

split_dates <- str_split(date_ranges, pattern = fixed(" - "))

split_dates
[[1]]
[1] "23.01.2017" "29.01.2017"

[[2]]
[1] "30.01.2017" "06.02.2017"

Replacing matches in strings

ids <- c("ID#: 192", "ID#: 118", "ID#: 001")

# Replace "ID#: " with ""
id_nums <- str_replace(ids, "ID#: ", "")

id_nums
[1] "192" "118" "001"
phone_numbers <- c("510-555-0123", "541-555-0167")

str_replace_all(phone_numbers, "-", ".")
[1] "510.555.0123" "541.555.0167"

Lab

  • open names.txt and copy the content
  • you’ll turn a vector of full names, like “Bruce Wayne”, into abbreviated names like “B. Wayne”. This requires combining str_split(), str_sub() and str_c().
  • do task using str_split with simplify=TRUE
  • calculate how many names end with a, h, s and e.

Regular expressions (Advanced)

Why Regular expressions

  • Most commonly when working with strings
  • Extracting something from something
  • Which number is the following string: "10202"?
  • Is there a number in a string: "102a"? What about in this "1O2"?
  • We would like to seperate the following string into 3 columns "2,32.1,0.4"!

Regular expressions

  • syntax to describe patterns
  • functions on patterns

grep

  • grep function from base
  • sub for replacment
  • stringr package
  • grep - global regex print. Is there a patern in a string?
string <- "car"
pattern <- "car"
grep(pattern, string)
[1] 1
string <- c("car", "cars", "in a car", "truck", "car's trunk")
pattern <- "car"
grep(pattern, string)
[1] 1 2 3 5

grepl

  • grepl - returns logical value
string <- c("car", "cars", "in a car", "truck", "car's trunk")
pattern <- "car"
grepl(pattern, string)
[1]  TRUE  TRUE  TRUE FALSE  TRUE

Meta and special characters

  • special characters: . \ | ( ) [ { ^ $ * + ?
  • \- escape character
  • . - any (just one) character
  • ^ - begining of a string
  • $ - end of string
string <- c("car", "cars", "in a car", "truck", "car's trunk")
grep("^c.r",string)
[1] 1 2 5
grep("^c..$",string)
[1] 1
grep("^c.r.$",string)
[1] 2

Alphanumeric character

  • w
grep("\\w", c(" ", "a", "1", "A", "%", "\t"))
[1] 2 3 4
grepl("\\w", c(" ", "a", "1", "A", "%", "\t"))
[1] FALSE  TRUE  TRUE  TRUE FALSE FALSE

Non - alphanumeric character

  • W
grep("\\W", c(" ", "a", "1", "A", "%", "\t"))
[1] 1 5 6
grepl("\\W", c(" ", "a", "1", "A", "%", "\t"))
[1]  TRUE FALSE FALSE FALSE  TRUE  TRUE

Whitespace

  • s
grep("\\s", c(" ", "a", "1", "A", "%", "\t"))
[1] 1 6

Non - whitespace

  • S
grep("\\S", c(" ", "a", "1", "A", "%", "\t"))
[1] 2 3 4 5

Digit

  • d
grep("\\d", c(" ", "a", "1", "A", "%", "\t"))
[1] 3

Non - digit

  • D
grep("\\D", c(" ", "a", "1", "A", "%", "\t"))
[1] 1 2 4 5 6

Possible values for a character

  • using []
grep("^[abc]\\w\\w", c("car", "bus", "no", "cars"))
[1] 1 2 4
grep("^[abc]\\w\\w$", c("car", "bus", "no", "cars"))
[1] 1 2

All n-characters smallcaps words

grep("^[a-z][a-z][a-z]$", c("Car", "Cars", 
"cars","car", "no", "three:", "tic", "tac"))
[1] 4 7 8

One or two digits anywhere

  • | - or sign
  • () - group
grep("((\\d)|([1-9]\\d))", c("1", "20", "0", "zero", "it is 100%", "09"))
[1] 1 2 3 5 6

Exactly one or two digits

grep("^((\\d)|([1-9]\\d))$", c("1", "20", "0", "nič", "to je 100%", "09"))
[1] 1 2 3
grep("^((\\d)|(\\d\\d))$", c("1", "20", "0", "nič", "to je 100%", "09"))
[1] 1 2 3 6

Repeating (1)

  • repeating operators refer to last character or group
  • ? - matches at most 1 times
  • * - matches at least 0 times
  • + - matches at least 1 times
  • {m} – matches exactly m times
  • {m, n} – matches between m and n times
  • {m, } – matches at least m times
string <- c("a", "ab", "acb", "accb", "acccb", "accccb")
grep("ac*b", string)
[1] 2 3 4 5 6
grep("ac+b", string)
[1] 3 4 5 6

Repeating (2)

string <- c("a", "ab", "acb", "accb", "acccb", "accccb")
grep("ac?b", string, value=TRUE)
[1] "ab"  "acb"
grep("ac{2}b", string, value = TRUE)
[1] "accb"
grep("ac{2,}b", string, value = TRUE)
[1] "accb"   "acccb"  "accccb"
grep("ac{2,3}b", string, value = TRUE)
[1] "accb"  "acccb"

All smallcaps letter words

grep("^([a-z]+ )*[a-z]+$", c("words", "words or sentences",
    "123 no", "Words"," word","word 123"))
[1] 1 2

Words with letters and length from 3-5

grep("^[a-z]{3,5}$", c("words", "words or sentences",
    "123 no", "Words"," word","word 123","hey"))
[1] 1 7

Signed numbers

grep("^[+-]?(0|[1-9][0-9]*)$", c("++0", "+1", "01", "-99"))
[1] 2 4

Greedy and Lazy Repetition

  • the repetition operators or quantifiers are greedy
  • how to make them lazy?
  • regexpr
  • regmatches
string<-"This is a <EM>first</EM> test"
pattern<-"<.+>"

r<-regexpr(pattern,string)
regmatches(string, r)
[1] "<EM>first</EM>"
##lazy
pattern<-"<.+?>"
r<-regexpr(pattern,string)
regmatches(string, r)
[1] "<EM>"

gregexpr

  • finds all positions and lengths of matched patterns
text<-"Yesterday I had 100 Euros, today I only have 45 Euros left."
gregexpr("(\\d+)",text)
[[1]]
[1] 17 46
attr(,"match.length")
[1] 3 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

regmatches

  • extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec
text<-"Yesterday I had 100 Euros, today I only have 45 Euros left."

regmatches(text, gregexpr("\\d+",text))
[[1]]
[1] "100" "45" 

Lab

  • Open regular.txt and complete the excercises

Stringr and regular expressions

  • str_subset (grep)
fruit %>% 
  str_subset("\\s")
 [1] "bell pepper"       "blood orange"      "canary melon"     
 [4] "chili pepper"      "goji berry"        "kiwi fruit"       
 [7] "purple mangosteen" "rock melon"        "salal berry"      
[10] "star fruit"        "ugli fruit"       
  • str_detect (grepl)
  • str_extract/str_extract_all (gregexpr+regmatches)
text<-"Yesterday I had 100 Euros, today I only have 45 Euros left."
str_extract(text,"\\d+")
[1] "100"
str_extract_all(text,"\\d+")
[[1]]
[1] "100" "45" 

Text mining (optional)

Data

include_graphics("./Pictures/Data.png")

Data

Why text mining?

  • 85-90 percent of all corporate data is in some kind of unstructured form (e.g., text )
  • unstructured corporate data is doubling in size every 18 months
  • text Mining Concepts Benefits of text mining are obvious especially in text rich data environments e.g
  • law (court orders)
  • academic research (research articles)
  • finance (quarterly reports)
  • medicine (discharge summaries)
  • marketing (customer comments)
  • electronic communication records (e.g., Email) Spam filtering Email prioritization and categorization Automatic response generation

Key steps

1. Collection of text document

2. Pre – processing of text

3. Text mining techniques

4. Analyze the text

5. Knowledge discovery

1. Collection of text document

  • web scraping
  • scanning and OCR
  • internal documents

2. Pre – processing of text

  • tokenization
  • removal of stop words
  • stemming

Preprocessing: tokenize and N-grams

  • N-grams are combination of letters of lenght n in the source text
include_graphics("./Pictures/ngrams.jpg")

Tokenize - example

text <- c("Great white shark just ate my leg.","Not a wonderful day and days!")
text_df <- tibble(id = 1:2, text = text)

text_df %>%
  unnest_tokens(word, text)
# A tibble: 13 × 2
      id word     
   <int> <chr>    
 1     1 great    
 2     1 white    
 3     1 shark    
 4     1 just     
 5     1 ate      
 6     1 my       
 7     1 leg      
 8     2 not      
 9     2 a        
10     2 wonderful
11     2 day      
12     2 and      
13     2 days     

Removal of stop words

  • the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc)

  • not adding much information to the text

  • examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”,…

  • why removing stop words; removing the low-level information from our text in order to give more focus to the important information

  • Do we always remove stop words? NO!

  • Before removing stop words, research a bit about your task and the problem you are trying to solve, and then make your decision!

Removal of stop words

data(stop_words)

text_df %>%
  unnest_tokens(word, text) %>% 
  anti_join(stop_words)
# A tibble: 7 × 2
     id word     
  <int> <chr>    
1     1 white    
2     1 shark    
3     1 ate      
4     1 leg      
5     2 wonderful
6     2 day      
7     2 days     

Stemming and lemmatization

  • Stemming: the process of reducing a word to its stem or root format
  • Lemmatization: the transformation that uses a dictionary to map a word’s variant back to its root format

Stemming and lemmatization

Stemming and lemmatization - stemming

text_df %>%
  unnest_tokens(word, text) %>% 
  mutate(word=wordStem(word))
# A tibble: 13 × 2
      id word  
   <int> <chr> 
 1     1 great 
 2     1 white 
 3     1 shark 
 4     1 just  
 5     1 at    
 6     1 my    
 7     1 leg   
 8     2 not   
 9     2 a     
10     2 wonder
11     2 dai   
12     2 and   
13     2 dai   

Stemming and lemmatization - lemmatization

text_df %>%
  unnest_tokens(word, text) %>% 
  mutate(word=lemmatize_words(word))
# A tibble: 13 × 2
      id word     
   <int> <chr>    
 1     1 great    
 2     1 white    
 3     1 shark    
 4     1 just     
 5     1 eat      
 6     1 my       
 7     1 leg      
 8     2 not      
 9     2 a        
10     2 wonderful
11     2 day      
12     2 and      
13     2 day      

3. Text mining techniques

Concepts

  • bag of words
  • NLP

Bag of words

  • mostly used technique
  • every word is independent (mostly)
  • stemming/lemmatization (tourist, tourists and tourism may be the same word)
  • word frequency
  • POS
  • sentiment analysis (different lexicons)
  • entities extraction
  • topics identification (e.g. LDA algorithm)

NLP

  • uses dictionaries to learn (e.g. Stanford NLP)
  • a subfield of artificial intelligence and computational linguistics. the study of “understanding” the natural human language

Demo

Stanford NLP

Sentiment analysis

  • lexicons
  • simple (e.g. Liu&Hu): 1 negative, 0 neutral, +1 positive
  • advanced (e.g. AFINN): -5<–>+5

How to do it in R - 1

  • many packages:

    • tm, LDA, textmineR, tidyr, sentimentr,…
data<-read_xlsx("TA_reviews.xlsx")
sentences<- get_sentences(data$fullrev[1:2])
sentences
[[1]]
[1] "The hotel is ideally located and is in a beautiful building."                                                                        
[2] "Most of the staff are very polite and helpful."                                                                                      
[3] "Rooms are comfortable and it has a serviceable gym."                                                                                 
[4] "Avoid going to breakfast before 0700 or wearing flip flops or slippers, you will be admonished and sent back to your room to change."

[[2]]
[1] "The hotel is a short walk to the pedestrian mall, restaurants and cafes."    
[2] "The hotel is an old historical landmark."                                    
[3] "I loved the tall ceilings, lobby and restaurant."                            
[4] "The bathroom has been updated and is very nice."                             
[5] "The breakfast buffet is very good with many options and you can eat outside."
[6] "We enjoyed our stay here."                                                   

attr(,"class")
[1] "get_sentences"           "get_sentences_character"
[3] "list"                   

How to do it in R - 2

dfJR<-sentiment_by(sentences,polarity_dt=lexicon::hash_sentiment_jockers_rinker)
dfSE<-sentiment_by(sentences,polarity_dt=lexicon::hash_sentiment_sentiword)
dfJR
   element_id word_count        sd ave_sentiment
1:          1         52 0.3861717     0.2993681
2:          2         56 0.2501910     0.2914671
dfSE
   element_id word_count        sd ave_sentiment
1:          1         52 0.1763900    0.14263246
2:          2         56 0.1420433    0.02832483

How to do it in R - 3

sentences<-"The great white shark just ate my leg!"

dfJR<-sentiment_by(sentences,polarity_dt=lexicon::hash_sentiment_jockers_rinker)
dfSE<-sentiment_by(sentences,polarity_dt=lexicon::hash_sentiment_sentiword)
dfJR
   element_id word_count sd ave_sentiment
1:          1          8 NA   -0.03535534
dfSE
   element_id word_count sd ave_sentiment
1:          1          8 NA   -0.04787702

How to deal with datetime

Creating date/times - 1

  • the default format for date is yyyy-mm-dd
  • the default format for time is hh:mm:dd
  • a date-time is a date plus a time
  • native class for storing time: hms package
date<-"2019-05-05"
time<-"18:51:32"

datetime<-paste(date,time)

as.POSIXct(datetime) ## base R
[1] "2019-05-05 18:51:32 CEST"
as.POSIXlt(datetime) ## base R
[1] "2019-05-05 18:51:32 CEST"
as_datetime(datetime) ## lubridate
[1] "2019-05-05 18:51:32 UTC"

Creating date/times - 2

  • to get the current date or date-time you can use today() or now() - same as in Excel
  • from a string, from individual date-time components, from an existing date/time object
Sys.Date() ## base
[1] "2024-01-22"
today() ## lubridate
[1] "2024-01-22"
Sys.time() ## base
[1] "2024-01-22 14:06:57 CET"
now()   ## lubridate
[1] "2024-01-22 14:06:57 CET"

From strings - 1

  • using lubridate
ymd("2017-01-31")
[1] "2017-01-31"
mdy("Januar 31, 2017")
[1] NA
dmy("31-Jan-2017")
[1] "2017-01-31"
dmy("17/10/2018")
[1] "2018-10-17"

From strings - 2

ymd_hms("2017-01-31 20:11:59")
[1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
[1] "2017-01-31 08:01:00 UTC"

From strings - 3

date_times <- c("2021-04-25 14:30:00", "04/25/2021 02:30 PM", "2021.04.25 14:30")
parsed_dates <- parse_date_time(date_times, orders = c("ymd HMS", "mdy HM", "ymd HM"))

parsed_dates
[1] "2021-04-25 14:30:00 UTC" "2021-04-25 02:30:00 UTC"
[3] "2021-04-25 14:30:00 UTC"

Individual components

  • make_date()
  • make_datetime()
year<-2007
month<-11
day<-5
hour<-15
minutes<-7
make_date(year, month, day)
[1] "2007-11-05"
make_datetime(year, month, day, hour, minutes)
[1] "2007-11-05 15:07:00 UTC"

From other types

  • as_datetime()
  • as_date()
as_datetime(today())
[1] "2024-01-22 UTC"
as_date(now())
[1] "2024-01-22"

Lab

  • Use the appropriate lubridate function to parse each of the following dates:
- d1 <- "January 1, 2010"
- d2 <- "2015-Mar-07"
- d3 <- "06-Jun-2017"
- d4 <- c("August 19 (2015)", "July 1 (2015)")
- d5 <- "12/30/14" # Dec 30, 2014

Date-time components - 1

x<-ymd_hms("2019-05-05 19:23:13")
year(x)
[1] 2019
month(x)
[1] 5
mday(x)
[1] 5
yday(x)
[1] 125

Date-time components - 2

wday(x)
[1] 1
hour(x)
[1] 19
minute(x)
[1] 23
second(x)
[1] 13

Date-time components - 3

  • for month() and wday() you can set label = TRUE
  • abbr = FALSE to return the full name
month(datetime, label = TRUE)
[1] May
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
wday(datetime, label = TRUE, abbr = FALSE)
[1] Sunday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Lab

  • On which day of the week were you born?
  • On which day of the week will you celebrate 40th birthday?

Durations - 1

  • when you subtract two dates, you get a difftime object
  • a difftime class object records a time span of seconds, minutes, hours, days, or weeks
  • lubridate provides an alternative which always uses seconds: the duration
   my_age<-today()-ymd("1975-09-11")
   my_age
Time difference of 17665 days
   as.duration(my_age)
[1] "1526256000s (~48.36 years)"

Durations - 2

dseconds(15)
[1] "15s"
dminutes(10)
[1] "600s (~10 minutes)"
dhours(c(12, 24))
[1] "43200s (~12 hours)" "86400s (~1 days)"  

Durations - 3

ddays(0:5)
[1] "0s"                "86400s (~1 days)"  "172800s (~2 days)"
[4] "259200s (~3 days)" "345600s (~4 days)" "432000s (~5 days)"
dweeks(3)
[1] "1814400s (~3 weeks)"
dyears(1)
[1] "31557600s (~1 years)"

Periods - 1

  • periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months
seconds(15)
[1] "15S"
minutes(10)
[1] "10M 0S"
hours(c(12, 24))
[1] "12H 0M 0S" "24H 0M 0S"
days(7)
[1] "7d 0H 0M 0S"

Periods - 2

months(1:6)
[1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
[5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"
weeks(3)
[1] "21d 0H 0M 0S"
years(1)
[1] "1y 0m 0d 0H 0M 0S"

Periods - 3

  • you can add periods to dates
ymd("2018-10-19") + dyears(1)
[1] "2019-10-19 06:00:00 UTC"

Periods - 4

  • contrive a period to/from a given number of seconds
timeDiff<-now()-ymd_hms("2021-01-01 00:00:00")
timeDiffS<-as.duration(timeDiff)

# to period
seconds_to_period(timeDiffS)
[1] "1116d 13H 6M 57.9939050525427S"

Lab

  • import NYCTaxi.xlsx from link
  • convert pickup_datetime, dropoff_datetime to datetime format (e.g. dataNYC$pickup_datetime<-ymd_hms(dataNYC$pickup_datetime))
  • calculate the mean duration drive
  • use seconds_to_period function

As.date - 1

  • %Y: 4-digit year (1982)
  • %y: 2-digit year (82)
  • %m: 2-digit month (01)
  • %d: 2-digit day of the month (13)
  • %A: weekday (Wednesday)
  • %a: abbreviated weekday (Wed)
  • %B: month (January)
  • %b: abbreviated month (Jan)

As.date - 2

    as.Date("09/28/2008", format = "%m / %d / %Y")
[1] "2008-09-28"

Locale - 1

  • change locale
   currentL<- Sys.getlocale()
   Sys.setlocale("LC_ALL","German")
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
   date<-ymd("20180301")
   format(date, "%d-%m-%y")
[1] "01-03-18"
   format(date, "%d-%b-%Y")
[1] "01-Mrz-2018"
   format(date, "%d-%B-%y")
[1] "01-März-18"

Locale - 2

   format(date, "%b %d, %Y")
[1] "Mrz 01, 2018"
   format(date, "%B %d, %Y")
[1] "März 01, 2018"
   wday(date, label = TRUE, abbr=FALSE)
[1] Donnerstag
7 Levels: Sonntag < Montag < Dienstag < Mittwoch < Donnerstag < ... < Samstag
   Sys.setlocale("LC_ALL",currentL)
[1] ""

Lab

Use sys.getlocale and sys.set.locate to:

  • display today’s month in Checz
  • display today’s day in Swedish

Lab

In this exercise you will work with the date, “1930-08-30”, Warren Buffett’s birth date! Mind the locale language!

  • Use as.Date() and an appropriate format to convert “08,30,1930” to a date (it is in the form of “month,day,year”)
  • Use as.Date() and an appropriate format to convert “Aug 30,1930” to a date
  • Use as.Date() and an appropriate format to convert “30aug1930” to a date
  • also solve previous tasks with lubridate functions

Update object

date <- ymd("2009-02-10")
update(date, year = 2010, month = 1, mday = 1)
[1] "2010-01-01"
update(date, minute = 10, second = 3)
[1] "2009-02-10 00:10:03 UTC"